Language/OS - Multiplatform Resource Library

home *** CD-ROM | disk | FTP | other *** search

/ Language/OS - Multiplatform Resource Library / LANGUAGE OS.iso / oper_sys / quartz / quartz10.lha / doc / quartz.ms < prev next >

Wrap

Text File | 1990-05-14 | 88KB | 2,041 lines

. \" PT - page titles .de PT .lt \\n(LTu .pc % .nr PN \\n% .nr PT \\n% .if \\n(P1 .nr PT 2 .if \\n(PT>1 .if !\\n(EH .if !\\n(OH .tl .if \\n(PT>1 .if \\n(OH .if o .tl \\*(O1 .if \\n(PT>1 .if \\n(EH .if e .tl \\*(E2 .lt \\n(.lu .. .EQ delim $$ .EN .nr LL 7.0i .nr PO 0.75i .nr PI 0.1i .nr HM 0.75i .nr FM 1.25i .ps 9 .nr PS 9 .vs 10 .nr VS 10 .LP .ce 1000 .LG .LG \fBQuartz: A Tool for Tuning Parallel Program Performance\fR .sp 1.5 .SM Thomas E. Anderson and Edward D. Lazowska .sp 0.5 .SM Department of Computer Science and Engineering University of Washington Seattle WA 98195 .sp 0.5 September 1989 .sp 2 .LG \fBAbstract\fR .SM .ce 0 .sp 0.5 .PP .nh Initial implementations of parallel programs typically yield disappointing performance. Tuning to improve performance is thus a significant part of the parallel programming process. The effort required to tune a parallel program, and the level of performance that eventually is achieved, both depend heavily on the quality of the instrumentation that is available to the programmer. .PP This paper describes Quartz, a new tool for tuning parallel program performance on shared memory multiprocessors. The philosophy underlying Quartz was inspired by that of the sequential UNIX tool gprof: to appropriately direct the attention of the programmer by efficiently measuring just those factors that are most responsible for performance and by relating these metrics to one another and to the structure of the program. This philosophy is even more important in the parallel domain than in the sequential domain, because of the dramatically greater number of possible metrics and the dramatically increased complexity of program structures. .PP The principal metric of Quartz is \fInormalized processor time\fR: the total processor time spent in each section of code divided by the number of other processors that are concurrently busy when that section of code is being executed. Tied to the logical structure of the program, this metric provides a "smoking gun" pointing towards those areas of the program most responsible for poor performance. This information can be acquired efficiently by checkpointing to memory the number of busy processors and the state of each processor, and then statistically sampling these using a dedicated processor. .\" *** I think it was much clearer the way it was. .\" This information can be .\" acquired efficiently by checkpointing to memory the number of busy .\" processors and the state of each processor, and then statistically .\" sampling these using a dedicated processor. .\" Quartz computes a weighting factor for each procedure .\" by dividing processor time by concurrency .\" (summing over each element of the vector). .\" Quartz ties this metric to the logical structure of the program .\" by reporting it for each procedure and for .\" all of the activity performed on that procedure's behalf, .\" whether invoked synchronously via subroutine call or .\" asynchronously through parallelization. .\" The result is a "smoking gun" pointing towards those .\" areas most responsible for poor performance. .PP In addition to describing the design rationale, functionality, and implementation of Quartz, the paper examines how Quartz would be used to solve a number of performance problems that have been reported as being frequently encountered, and describes a case study in which Quartz was used to significantly improve the performance of a CAD circuit verifier. .sp .LP .nh \fIIndex Terms\fR \- multiprocessor, performance, measurement, parallel programming, tuning .sp 2 .MC 3.33i .FS .sp 0.5 This material is based on work supported by the National Science Foundation (Grants No. CCR-8619663, CCR-8703049, and CCR-8700106), the Naval Ocean Systems Center, the Washington Technology Center, Digital Equipment Corporation (the Systems Research Center and the External Research Program), and IBM (a Graduate Fellowship). .sp Authors' address: Department of Computer Science and Engineering FR-35, University of Washington, Seattle WA 98195; (206) 545-2675; \fItom/lazowska@cs.washington.edu\fR. Quartz source code is available by anonymous ftp from cs.washington.edu. .sp 0.75i .FE .LP .NH Introduction .nr H1 1 .nr H2 0 .PP .nh The primary motivation behind building multiprocessors is to cost-effectively improve system performance. Even moderately increasing a uniprocessor's power can require substantial additional design effort as well as faster, and thus more expensive, hardware components. By contrast, once a scheme for interprocessor communication has been added to a uniprocessor design, the system's peak processing power can be increased linearly simply by adding processors. The incremental cost per processor has been reported to be as little as 15% of the initial system cost for small to moderate numbers of processors [Thacker et al. 1988], and larger but still close to linear for greater numbers of processors [BBN 1985; Pfister et al. 1985]. .PP Of course, multiprocessors lose their advantage if this processing power is not effectively utilized, and while it is relatively easy to get good performance when there are multiple independent sequential job streams, it can be difficult to achieve good performance from parallel applications. The literature describes many attempts to parallelize algorithms and applications. (Burkhart and Millen [1989] survey some of this experience.)\ Typically, an initial implementation results in disappointing performance, but significant improvements can be obtained with subsequent effort. Sequent, for example, tells of a major customer whose first attempt at parallelizing a "dusty deck" resulted in a program that, given 8 processors, executed only 50% as fast as the original sequential program. After considerable effort by skilled engineers, nearly perfect speedup (a factor of nearly 8 on an 8 processor machine) was achieved [Rodgers 1986]. .PP A major factor contributing to the large amount of skilled effort typically required to achieve good parallel program performance is the shortage of good performance analysis tools. In the absence of such tools, performance problems must be identified through a combination of guesswork, folklore, and application-specific instrumentation. The subject of this paper is the design rationale, functionality, implementation, and use of a new tool for tuning parallel program performance. .PP The philosophy underlying our work is that an effective tool for tuning parallel program performance must be based on a clear view of the causes of poor performance, and on a specific methodology for improving that performance. By being selective about what it measures and presents, the tool can focus the programmer's attention on the information needed to tune performance, eliding details about second-order effects. Measurement efficiency also is improved by designing the tool to record just the important behavior. .PP Selectivity is possible because, although parallel performance in general is much more complex than sequential performance, experience (discussed in Section 3.2) suggests that poor parallel performance typically arises from a relatively small number of factors. For applications whose performance is dominated by periods of limited parallelism, the tool should identify those sections of code that account for most of this time so that these sections can either be re-structured to increase concurrency or optimized to reduce their impact on overall performance. Time spent spin- (or "busy"-) waiting must be correctly represented, since spinning processors appear to be busy even though they are not computing useful results. Finally, for applications with large amounts of real parallelism, performance can only be improved by optimizing (but not further parallelizing) the code that executes for the greatest proportion of time. .PP Based on these observations, we propose a new way to view parallel program performance on shared memory multiprocessors. We focus on both the total processor time devoted to each section of code and the number of other processors concurrently busy (as opposed to idle or spin-waiting) when that section of code is being executed. Routines can be compared by considering their \fInormalized processor time\fR: their processor time divided by the concurrent parallelism (a precise definition is given in Section 3). This metric usually reflects the relative importance of different sections of code to the overall elapsed time of the program: a routine that executes while no other processors are busy can be responsible for a large percentage of the runtime of a program, even though it uses only a small fraction of the total processor time. Further, by measuring both parallelism and processor time, we can determine whether performance can be improved by re-structuring to increase parallelism, or only by simple optimization. .PP We tie these measurements to the logical structure of the program's procedures. Good engineering practice demands that large programs, whether sequential or parallel, be structured using hierarchical abstractions [Graham et al. 1982]. We report our performance measures for each procedure and for all the work done on its behalf, either synchronously via a normal procedure call or asynchronously through parallelization. The programmer can use this to walk through the hierarchy, focusing on just those procedures that, along with their children, account for most of the poor performance. We expect that for parallel programs as for sequential ones, a relatively small proportion of the code will be responsible for most of the runtime. .\" *** I find the following paragraph confusing at this point in .\" *** the paper, and I don't know how to fix it, so I've commeted .\" *** it out. (I think this is ok to leave out -- tom) .\" .PP .\" We also propose a new type of program performance .\" measurement: data-oriented instead of procedure-oriented. .\" We measure the breakdown of time spent on behalf of data objects, .\" since parallel performance can depend as much on the data object .\" being manipulated as on the function being computed. .\" If a procedure is executed in parallel, performance can depend on .\" the longest time it takes to execute on some piece of data, .\" rather than its average time; performance can also be limited .\" by the requirements of synchronizing accesses .\" to particular data objects. To provide a single focus for the .\" programmer's attention, we .\" integrate data-oriented and function-oriented measurements into the .\" same hierarchy defined by the program's logical structure. .PP These measurements can be made efficiently on a shared memory multiprocessor by checkpointing to memory the number of busy processors and the state of each processor, and then statistically sampling this information using a dedicated processor. .PP We have developed a tool to test these ideas, called \fIQuartz\fR. Quartz was built by modifying the application-level thread package described in [Anderson et al. 1989]. Quartz uses only the normal profiling support available on UNIX-like shared memory multiprocessors; it currently runs on the Sequent Symmetry multiprocessor [Sequent 1988]. .PP The remainder of the paper discusses these ideas in more detail. Section 2 examines existing measurement tools for tuning program performance. Section 3 describes Quartz: its motivation, its functionality, its applicability to a number of performance problems that have been reported as being frequently encountered, and its implementation. Section 4 describes a case study in the use of Quartz to improve the performance of a specific parallel application, a CAD circuit verifier. Section 5 considers the implications of our work for the monitoring of sequential programs and non-shared-memory multiprocessors. Section 6 summarizes our results. .NH Existing Tools for Tuning Program Performance .NH 2 Tools for Sequential Programs .PP .nh The philosophy underlying Quartz owes much to the experience of UNIX gprof [Graham et al. 1982], a tool for tuning the performance of sequential programs running on uniprocessors. .PP Years of experience tuning sequential programs indicate that the major difficulty is focus: it is relatively easy for the programmer to improve the processing time of a small section of code, but lots of effort is commonly wasted in the wrong places \- tweaking code that has only a small impact on overall performance. .PP Gprof's solution is to highlight the "hot spots" of the program, and to do so in a way that exploits the hierarchical structure of large programs. Gprof presents to the programmer the total processor time of each procedure, including time spent on its behalf if it calls other routines. With this information, the programmer can tune the program in a top-down fashion, focusing effort on those functions that have the greatest impact on performance. .PP Gprof is relatively efficient. It periodically interrupts the program to sample the program counter, thereby estimating the execution time of each procedure. While sampling produces only an estimate, the approach is most accurate just where it needs to be: for those routines where the program spends most of its time. Gprof also collects the call graph: who called whom how many times. This is done by using compiler support that makes each procedure execute a monitoring routine during its prologue. Gprof then computes its central metric, the processor time spent on a procedure's behalf, by making the assumption that all calls to the same procedure take the same amount of time. .\" *** this don't add much anymore .\" (i.e., that processor time is not call-source-dependent). Processing time is propagated bottom-up from callee to caller according to the caller's proportion of the total calls. .PP Gprof seems so natural in retrospect that it is easy to forget the alternative approach taken by a number of other tools: to (expensively) measure everything that could conceivably be of relevance to program performance, and to report these measurements without concern for how they relate to each other or to the structure of the program. .PP Our goal for Quartz was to achieve a tool for tuning parallel program performance that is analogous to gprof in that it efficiently measures exactly what is important, and relates these measurements to one another and to the structure of the program. This philosophy is even more important in the parallel domain than in the sequential domain, because of the dramatically greater number of performance metrics and the dramatically increased complexity of program structures. The next two sub-sections discuss, in this context, existing approaches to tools for tuning parallel program performance. .NH 2 Non-Integrated Tools for Parallel Programs .PP Many useful measures of parallel program performance have been proposed. Each provides a view of some important aspect of program behavior. However, in many existing tools, their lack of integration with each other and with program structure limits their usefulness. .PP Perhaps the simplest approach to parallel program measurement is to extend sequential UNIX gprof to run on a multiprocessor. In place of processor time on one processor, multiprocessor gprof measures the sum of the time spent on each processor [Aral & Gertner 1988]. .\" *** I didn't see that this was a major point; reinstate it if .\" *** you disagree. (its clearer now using processing time) .\" (Note that at any given moment, each processor may be executing .\" a different procedure.)\ Unfortunately, as we have noted, a procedure's total processor time is not related in a simple way to parallel runtime. .\" Given two procedures with the .\" same processor time, a procedure that always executes when .\" all other processors are idle, .\" such as an initialization routine, will have a disproportionate impact .\" on performance compared to a routine that always runs when other .\" processors are busy. Something more than a straightforward adaptation of sequential UNIX gprof clearly is necessary. .PP Another common tool displays the number of busy processors across time by periodically sampling the number of runnable processes. Assuming that all activity is due to the program in question, this allows the programmer to see if there are periods of time when there is too little parallelism to keep all the processors busy [Halstead 1986]. A significant shortcoming of tools like this is that it can be difficult to relate the periods of poor parallelism to specific sections of code that can be changed. Further, the fact that some processors are not doing useful work can be concealed, since spin-waiting processors appear to be busy. .PP The DEC SRC Firefly has a tool that measures the time spent waiting for each lock protecting a shared data structure [Thacker et al. 1988]. A lock ensures that threads manipulating the shared data structure have mutually exclusive access to it. This serial execution can limit performance. By measuring the wait time, the tool can determine which critical sections are the worst bottlenecks; these can then be re-structured to increase parallelism. This information is useful, but long waits for a lock will not affect performance if there is always other work to do during the wait, and monitoring a lock can increase the length of time that the lock is held, creating "artificial bottlenecks" when monitoring is enabled. .PP Quartz provides many of the same metrics as these tools, but correlates the metrics to one another and to the structure of the program. For example, Quartz measures not only how many processors are busy, but also which procedures execute during periods of low and high parallelism. .NH 2 Trace-Based Tools for Parallel Programs .PP The issue of determining in advance exactly what information will be needed to tune the performance of a parallel program can be finessed by recording a trace of every interprocessor synchronization event with a timestamp of when the event occurred. The behavior of the program can be completely reconstructed from such a trace [Fowler et al. 1988]. \(dg .FS \(dg Originally, Fowler et al. implemented trace collection to aid debugging the correctness of parallel programs; they then showed that the same support could be used for program tuning. An implication of our work is that the support needed for correctness debugging is not necessary for performance tuning. .FE Arbitrary metrics (whether general or application-specific) can be computed using a common interface to the trace data. Finally, the metrics obtained from the trace can be integrated with each other and with the structure of the program. .PP One drawback to this approach is that both the collection and the post-processing of trace files is expensive. For programs that perform frequent synchronization, the trace files can be prohibitively large. Consider a program running on a hypothetical shared memory multiprocessor with 20 5-MIPS processors, each of which on average performs a monitorable event (such as acquiring or releasing a spin lock) every 500 instructions, and where each event record is 10 bytes. If the program being monitored executes for one minute, the trace file will be 100 megabytes. Similar estimates appear in [Malony et al. 1989] to justify hardware support for recording traces. .\" *** I know I asked for it, but I don't like it! .\" (While traces .\" may be needed for debugging the correctness of a parallel program, .\" those traces are likely to be much smaller than those needed for .\" tuning performance, by using fewer processors on smaller data sets.) .\" *** OK, this is gone too! .\" Carpenter [1987] proposes hardware support .\" for collecting traces on an Intel iPSC/1; the amount of physical .\" memory in the system is \fIdoubled\fR so that trace data can be stored .\" locally to avoid distorting the performance behavior being measured. .\" *** Seemed somewhat redundant. I'm trying to reduce the amount .\" *** of text, and this seems expendable. Again, restore if you .\" *** strongly disagree. (no, it is.) .\" .PP .\" Collecting a complete trace postpones the problem of deciding .\" what is important. .\" Since a parallel program can be viewed as pieces of sequential .\" code connected by communication events [Hoare 1978], it might seem .\" that in order to understand parallel performance, .\" we would need to know the pattern of communication in time .\" This pattern of events can be graphically presented to the programmer, .\" with processors or processes along one axis and time along the other, .\" along with the ability to zoom in on areas that are performance .\" bottlenecks. As programs become more complex, however, .\" the programmer easily becomes lost, since there is no obvious connection .\" from the visual pattern of events .\" to overall performance or to the specific sections of code that require .\" improvement. .\" (The authors last year wrote .\" just such a system, with the predictable results.) .PP Nevertheless, tools have been developed that collect trace data and post-process it into useful measures. We argue later that the much of the information provided by these tools can be measured or approximated more efficiently by sampling. .PP Monit [Kerola & Schwetman 1987], for example, uses a trace file to compute the behavior of higher-level objects, such as the number of busy processors or the number of threads waiting to enter each critical section. The behavior of each object is then plotted on a timeline. After identifying those phases of execution with few busy processors, the programmer can visually correlate these phases to the behavior of other objects (discovering, for example, that parallelism is low while a specific critical section has a large number of waiting threads). .PP Although Monit's display is at a higher level than the raw trace data, it still can present a massive amount of data to the programmer. Only a few timelines will fit on a screen at a time, but Monit provides little help in identifying those containing information relevant to the measured lack of parallelism. As the complexity of the application increases, so does the number of objects to monitor, making focusing more important. .PP IPS [Miller & Yang 1987] attempts both to guide the programmer to performance problems and to provide useful statistics about those problems. Its central focusing metric is the time each process and procedure spends executing the critical path. The critical path is the longest path through the task graph \- the series of sequential pieces of code (that cannot be done in parallel because they communicate one to the next) that takes the longest to execute. By definition, the elapsed execution time of the program can be reduced only by shortening the length of the critical path. .PP One drawback to critical path analysis is its expense: it requires a complete trace of all interprocessor communication. (To be fair, IPS was originally designed to measure programs running on local area networks of uniprocessors. Because of the high latency and low bandwidth communication on these systems, only programs with relatively infrequent synchronization .\" very coarse-grained parallelism can run efficiently. Under these conditions, the size of the trace file would be manageable.)\ Yet critical path analysis is still just a heuristic: there .\" (as is our focusing metric): there is no guarantee that reducing the critical path will actually reduce the execution time of the program. There may be another path through the task graph with almost the same length that will be unaffected by the change. .\" *** Seems unnecessary. (right) .\" The existence of this other path .\" is related to the width of the task graph (the parallelism) .\" where the improvement is made. Improving sequential code will always .\" improve performance; improving code that always runs when there are .\" other busy processors may not, since those other processors are .\" likely computing needed results. .\" .PP Critical path analysis also does not indicate \fIhow\fR to reduce a procedure's completion time. One way is to reduce its sequential execution time. Another is to parallelize it. But parallelization will only be of benefit if there are idle processors to exploit when the procedure runs. .\" When used by itself, critical path analysis has some limitations. (For .\" this reason, IPS also computes some other measures of program behavior.) .NH Quartz: Its Functionality, Applicability, and Implementation .PP .nh Our goals for a tool for tuning parallel program performance are: .IP \- It should identify the sections of source code most responsible for poor performance. .IP \- It should present its measurements hierarchically, to allow top-down tuning according to the logical structure of the program. .IP \- It should measure parallelism (properly representing spin-waiting as a loss of parallelism) and it should tie this to the source code, identifying where re-structuring to increase parallelism is necessary and where code optimization is appropriate. .IP \- It should measure program behavior in sufficient detail to provide some insight into the type of re-structuring that will work. .IP \- It should do all this efficiently and without significantly affecting the behavior of the measured program. .PP Quartz meets these criteria. Before describing it, we must define some terms. A thread (or "lightweight process") is a sequential execution stream; it is the basic unit of parallel work. A thread starts another thread by giving it a procedure to run; the initial thread continues in parallel with the created thread. Thread creation is thus essentially an asynchronous procedure call. If threads are implemented as part of an application library, they can be within an order of magnitude of the cost of a procedure call [Anderson et al. 1989]; they can thus be used for procedure-level parallelism. .PP Threads can synchronize with one another. One type of synchronization object is a lock, used to ensure mutually exclusive access to a shared data structure. Another is a condition or barrier used to enforce a data dependency, as when one thread reads data produced by some other thread. In both cases, synchronization may cause the thread to wait, either because the lock is busy or because the data it requires has not been produced. Since there may be more threads than processors, a waiting thread has a choice: either spin until the lock is free or the data is available, or block, relinquishing the processor to run another thread. Thus, there is a difference between a program's \fIeffective\fR parallelism, the number of busy (not idle or spinning) processors, and its \fInominal\fR parallelism, the number of runnable threads, some of which may be spinning. Our measurements refer to the activity of just the processors executing the application, and not any processors concurrently executing other applications. .\" As a rule of thumb, the thread should spin if .\" the expected wait time is shorter than twice the time to do a scheduling .\" operation, or if there is no other work to do. .\" .PP .\" Two key terms in the discussion that follows are \fIeffective .\" parallelism\fR and \fInominal parallelism\fR. .\" By effective parallelism we mean .\" the number of busy (not idle or spinning) processors. .\" By nominal .\" parallelism we mean the number of runnable threads, some of which may .\" be spin-waiting. .\" *** NOTE: I'm not really sure that deleting this is OK. .\" .PP .\" Finally, we note that our discussion assumes .\" that the application is assigned a .\" fixed number of processors. Our measurements refer to the activity .\" of just those processors, and not any processors concurrently executing .\" other applications. .\" It is a straightforward .\" implementation issue to accommodate multiprogrammed multiprocessors .\" with processor reassignment, but the discussion is simplified by .\" ignoring this. .PP The remainder of this section is divided into three sub-sections. The first describes the functionality of Quartz: the specific metrics that it reports. The second shows how these metrics can be used to detect .\" demonstrates the applicability of Quartz by showing how it can be used to detect and fix a number of performance problems that have been identified by others as commonly occurring. The third provides an overview of the implementation of Quartz. A case study in which Quartz was used to tune a specific application is described in Section 4. .NH 2 The Functionality of Quartz .\" .PP .\" In this sub-section, we outline Quartz's functionality, discuss .\" how this functionality can be used to solve common performance problems, .\" delaying until afterwards a description of its implementation .\" on a shared memory multiprocessor. .PP The principal measurement made by Quartz is normalized processor time, defined in Equation 1, where $P$ is the number of processors. Measuring this weighting function for every procedure allows us to compare them according to their effect on overall performance. .KF .sp 0.5 .EQ sum from {i = 1} to {P}~{{Processor~time~with~i~processors~concurrently~busy} over {i}} .EN .sp 0.3 .ce 5 \fBEquation 1: Normalized Processor Time\fR .ce 0 .sp 0.5 .KE .\" each procedure's total processor time, .\" subdivided according to the distribution of the effective parallelism .\" while it is executing. .\" Quartz computes a weighting factor for each procedure .\" by dividing each level of effective parallelism .\" into the corresponding processor time value, and summing .\" these ratios. .\" ***** Ed's old attempt .\" To understand the rationale for this weighting factor, .\" consider a program that initializes its data structures .\" sequentially and then computes its result completely in parallel. .\" As a first example, assume .\" that the initialization and the computation take the same total .\" processor time. .\" Then the initialization will require a factor of $N$ greater .\" elapsed time (where $N$ is the number of processors) and .\" will have a much greater impact on performance than the .\" computation. .\" As a second example, assume that the initialization and .\" the computation take the same total elapsed time. .\" Then improvements in either will have equal impact on .\" performance. .PP To understand the rationale for this metric, consider a program with two functions, one that always executes sequentially when no other processors are busy and one that computes its result completely in parallel. If each function takes the same total processor time, the sequential one requires a factor of $P$ greater elapsed time (where $P$ is the number of processors) and will have a much greater impact on program performance. If the two functions take the same elapsed time, then the same percentage improvements in either will have equal impact on performance. .PP Further, knowing the concurrent effective parallelism while a routine executes is more important than knowing the effective parallelism it generates: a serial routine that is always overlapped with other computation will have little effect on performance compared to a serial routine that always executes by itself. This is despite the fact that the \fIelapsed time\fR of the two functions is the same. .\" *** The following is correct but confusing, and probably unnecessary. .\" A different way of viewing this weighting function is that we take the .\" time spent by idle or spinning processors, assign it to the procedures .\" that were executing when they were idle, and then normalize by the .\" number of processors. .PP Normalized processor time reflects these observations. Quartz also measures other program behavior; the principal performance measurements in Quartz are summarized in Table 1. .KF .sp 0.5 .TS center box; l. T{ .in -1.5i .ll +1.5i .nr LL +1.5i .LP .nh For each procedure, synchronization object, and thread, and for the work done on its behalf: .IP Normalized processor time. Each term of the sum is measured separately. .IP Elapsed time spent in each state (busy, spinning, blocked, or ready), along with the average and the distribution of the number of runnable threads while it is in each state. .sp 0.3 .in +1.5i .ll -1.5i .nr LL -1.5i T} .sp 0.5 .TE .\" .in -1i .\" .ll +1i .\" .nr LL +1i .\" .vs +1 .\" .nr VS +1 .ce 5 \fBTable 1: Principal Performance Measurements in Quartz\fR .ce 0 .sp 0.5 .KE .PP To focus the programmer's attention on those areas of the program that have the greatest impact on performance, we sort procedures by their normalized processor time plus that of the work done on their behalf. This includes work done synchronously or asynchronously (via threads). The program's top-level procedure, then, has a normalized processor time equal to the elapsed time of the program; the functions it calls to do the work of the program divide that weight among them. Quartz's ordering is analogous to what gprof does with processor time, except that Quartz uses a weighting function related to parallel performance. In both Quartz and gprof, the programmer can trace performance top-down through the program. .PP Normalized processor time indicates \fIwhere\fR improvements can be made. Quartz also provides information about \fIwhat\fR can be done to improve performance. Part of this information comes from the measurement of concurrent parallelism needed for normalized processor time. This indicates whether performance can be improved by re-structuring to increase effective parallelism, or only by simple optimization. Certainly, procedures that always execute when all processors are busy will not benefit from further parallelism. .PP Given that re-structuring is necessary, an accounting of the \fIelapsed time\fR spent by a procedure can help identify what kind of re-structuring is most likely to succeed. Quartz measures each procedure's elapsed time spent busy, spinning, blocked, or ready to run, along with the average and the distribution of the number of threads that are available to run while the procedure is in each state. For example, if a procedure is busy executing and there are few other busy processors, then the nominal parallelism indicates whether the other processors are idle or spinning. If idle, then performance can be improved by parallelizing the procedure (creating threads to do its work), provided this is possible. If spinning, then there is no benefit to creating more threads. Similarly, threads that are blocked or spinning represent deferred work; if the program were re-structured to reduce or eliminate waiting, then parallelism would increase. .PP Quartz makes the above measurements separately for each procedure, synchronization object, and thread. Measurement based purely on procedures would ignore the fact that parallel performance can depend more on the data object passed to a procedure than on the implementation of the procedure itself. It is only mildly interesting to know the total time spent waiting at all barriers; it is much more useful to know that some specific barrier accounted for most of that time. As a special case, the time spent executing in a critical section is attributed to the lock on that critical section. (Quartz also measures queue length distributions for synchronization objects.)\ Per-thread information allows us to determine if different threads executing the same procedure (on different data objects) have different performance. .PP By measuring synchronization objects and threads in the same way as procedures, we can present the programmer a uniform focusing metric. All are ranked in the same way to simplify tracing performance through the program. For example, if contention for a lock determines performance, the lock object will have a high normalized processor time since its critical section is always executing while few other processors are. (Spin time is factored out in computing normalized processor time.) .PP An important difference from gprof is that we explicitly measure the work done on behalf of a procedure or lock object. Gprof explicitly measures only the work done within each procedure, making the assumption that the processor time of its children is independent of who called them. Gprof uses the call graph (who called whom, and how many times) to propagate processor times from callee to caller according to the caller's percentage of the total calls to the callee. We cannot make a similar assumption. The effective and nominal parallelism while a procedure executes depend not only on what that procedure does, .\" (whether or not it creates parallelism), but also on the parallelism when it is called. Different calls to a low-level formatting routine might have vastly different concurrent parallelism. Still, even though it is not useful for propagating our measurements, Quartz does record the call graph (including calls to/from synchronization objects) to help the programmer in tracing performance top-down through the program. .NH 2 Detecting Frequently Encountered Performance Problems Using Quartz .PP In this section, we argue by example that Quartz is useful for detecting and fixing a number of common parallel program performance problems. We asserted in Section 1 that although parallel performance in general is much more complex than sequential performance, experience suggests that poor parallel performance typically arises from a relatively small number of factors. One piece of evidence for this is Table 2, which lists the performance problems most frequently encountered by three "vendors" of parallel computing systems who participated in a working group concerned with "Sources of Performance Degradation" at the \fINSF/CMU Workshop on Performance-Efficient Parallel Programming\fR in 1986. .\" The systems represent a range of architectural styles and .\" parallel programming methodologies. Sequent and Harris .\" represent shared memory multiprocessors, while Warp processors are .\" connected in a mesh, with high-bandwidth interprocessor .\" communication instead of shared memory. The key observation is that none of the problems involve subtle timing issues that might require a complete trace of synchronization activity. .KF .\" .in +1i .\" .ll -1i .\" .nr LL -1i .\" .vs -1 .\" .nr VS -1 .sp 0.5 .TS center box; l. T{ .in -1.5i .ll +1.5i .nr LL +1.5i .LP .nh Sequent .RS .IP 1. A problem decomposition that puts most of the work in one thread (e.g., the optimizing phase of a concurrent compiler or a "busy" region in a ray-tracing algorithm), so that little real concurrency can be realized. .IP 2. Memory thrashing due to a poor choice of operating system parameters. .IP 3. Excessive I/O that is not overlapped with computation. .IP 4. A synchronous software structure, such as might arise from a very large granularity, a producer-consumer relationship with a small number of buffers, or the use of an unnecessarily restrictive synchronization construct (e.g., barriers where critical sections would suffice). .RE .sp 0.5 .LP Harris .RS .IP 1. Synchronization overhead. .IP 2. Contention for shared variables, including counting semaphores, task queues, and the "problem heap". .IP 3. Starvation due to a small problem size. .RE .sp 0.5 .LP Warp .RS .IP 1. Excessive I/O that is not overlapped with computation. .IP 2. Data dependencies in loops. .RE .in +1.5i .ll -1.5i .nr LL -1.5i T} .sp 0.5 .TE .\" .in -1i .\" .ll +1i .\" .nr LL +1i .\" .vs +1 .\" .nr VS +1 .ce 5 \fBTable 2: Frequently Encountered Performance Problems\fR (\fINSF/CMU Workshop on Performance-Efficient Parallel Programming\fR) .ce 0 .sp 0.5 .KE .PP The first issue in tuning any parallel program is to identify which segments are responsible for the poor performance. As with sequential programs, we would expect that of the large number of functions in a parallel program, a relative few will be responsible for most of the program's runtime. By computing a weight based on both processor time and parallelism, and by accounting for all of the activity done on behalf of a function, Quartz allows the programmer to walk through the program hierarchy to find those few functions. Once the general area of difficulty has been located, the approach to tuning depends on the situation: .NH 3 Functional Decomposition .PP Some computations have several functionally distinct parts, each assigned to a distinct processor. An example of this is a pipelined compiler: separate threads (and processors) execute the scanner, parser, optimizer, and code generator, streaming results one to the next (Sequent #1 in Table 2). Performance difficulties usually relate to load imbalance. If one phase has more work to do than the others, the others must sit idle waiting for it. If the optimizer is the bottleneck, the scanner and parser will have to wait for buffer space to forward their partial results, while the code generator must also periodically wait for results to be completed. .PP Quartz would identify this problem: the thread executing the optimizer would have a longer execution time, spend more time executing when few other processors are busy, and thus have a larger normalized processor time than the threads executing the other phases. Other tools would also handle this case. For example, a display of processor activity across time would show that the processor executing the optimizer was always busy, while the other processors sometimes wait. (Of course, many tools that display processor activity fail to relate processors to procedures.)\ Similarly, a critical path analysis would show that the execution of the optimizer constituted most of the critical path. .PP It is also easy with Quartz to identify the phase that is the secondary bottleneck \- in the compiler example, the one that would limit performance if the optimizer were improved. If the code generator ran for almost as long as the optimizer, then it would have slightly less normalized processor time, indicating that attention should be focused on improving both phases. It is difficult to extract this information from a timeline, since all phases but the optimizer periodically block. Critical path analysis only identifies the primary bottleneck, so iteration would be required. .PP Another performance problem with pipelines is starvation (Harris #3). This occurs if the problem size is small relative to the time for each phase to start streaming results. In this case, the later phases spend much of the total time waiting to start; the earlier phases finish well before the program completes. Quartz would show that each phase spends much of its \fIelapsed time\fR idle. (Normalized processor time would highlight the first and last stages, since their work is the least overlapped with other stages.) A solution is to reduce the time before each phase starts streaming its first results. .NH 3 Data Decomposition .PP Some programs compute the same function on many pieces of data. These programs can be parallelized by assigning different pieces of data to different processors. Unlike functional decomposition, each processor executes the same function at the same time. Again, a frequent issue is load balancing: the required computation may vary widely for different pieces of data. An example of this is ray-tracing where part of the picture has the majority of the activity (Sequent #1); another is a fluid dynamics computation where turbulence is concentrated in certain regions. In such situations, performance is limited by the processor assigned to the data regions with the longest execution times. .PP Whether a different thread is used to run the function on each object, or on collections of objects, Quartz will show if the execution times are balanced. If they are not, one of the threads will execute while other processors are idle, and there will be long average queue lengths at the barrier that checks that all threads have completed before continuing. .PP It can be difficult to relate the performance of a thread to the symbolic names of the data objects it works on, particularly in conventional (non-object-oriented) languages. For instance, the procedure a thread is to execute can be passed an index that only implicitly refers to the object it is to work on. As a result, we rely on the programmer to make this connection by providing a symbolic name when each thread is created. In an object-oriented language such as C++ we could extend Quartz not only to keep track of the symbolic names of data objects passed to threads, but also to explicitly take measurements for each procedure-data object pair, to allow both an object-oriented and a procedure-oriented view of performance. We intend to port Quartz to Presto [Bershad et al. 1988], a C++ based parallel programming system, to further explore this topic. .\" .PP .\" Another problem that can arise with data decomposition is data dependency. .\" Effective parallelism can be limited if the computation on one piece of .\" data depends on the result of another. For instance, each loop iteration .\" can use a result computed in the previous iteration (Warp #2). .\" Initially there is lots of available parallelism, but soon each .\" blocks, and the computation proceeds sequentially or near-sequentially. .\" Whether a thread is used for each iteration, or a thread running on .\" each processor grabs iterations explicitly, there will be long .\" queue lengths at the point where the data dependency is enforced; .\" these will be detected by Quartz. .NH 3 Synchronization .PP The need to synchronize the work of different processors can cause another class of performance problems. For instance, execution time is increased by the overhead of parallelizing the job: distributing work to various processors, serializing access to shared data structures, and enforcing data dependencies (Harris #1). Even if the program is perfectly parallel, this time can dominate. Fortunately, it is easy to measure. If there is a sequential version of the program, many of its functions will correspond to equivalent functions in the parallel version, and the execution time of each function can be directly compared to determine the effect of overhead. (The execution time added by monitoring must of course be factored out.) Alternately, given measurements of the performance of the thread package, the number of calls to each thread function, such as to create a thread or to acquire a lock, can be used to compute overhead. .PP Performance can also be affected by waiting for data dependencies to be satisfied (Warp #2; Sequent #4) or for access to a busy critical section (Harris #2). Waiting threads represent deferred parallelism; Quartz identifies this by measuring queue lengths and the average wait time (the total elapsed time spent waiting divided by the number of accesses to that synchronization object). For example, if a loop data dependency limits parallelism, there will be a long queue length at the point where the data dependency is enforced. Note that two of the examples of contention cited in Table 2 are for locks within the thread package; we measure contention for these locks in the same way that we measure locks in the application code. .PP Even if there are many threads waiting on a synchronization object, the question of whether it makes sense to re-structure the program to release that parallelism depends on whether the time is spent spinning or blocked, and on the nominal parallelism. When there are at least as many runnable threads as processors, blocked threads have no impact on performance beyond the initial context switch. Re-structuring to increase the number of ready threads does not help in this case. By contrast, spin-waiting always wastes processing cycles, regardless of the number of runnable threads, but if there are excess runnable threads .\" and the average time spent spinning is longer than the twice the .\" context switch time then performance could be improved by blocking instead of spinning. .PP If re-structuring is necessary, the number of threads waiting at a lock can be decreased by any of: reducing the number of accesses (from the call graph), thereby reducing contention; decreasing the size of the critical section (its busy time); distributing accesses more evenly across time (if the queue length is sometimes zero and sometimes very long); or modifying the protected data structure to allow parallel accesses (for example, by giving each processor a separate copy). .PP Waiting for data dependencies can be reduced by computing the data earlier, or, if an overly restrictive synchronization construct was used, by allowing the thread to continue temporarily without it. Fuzzy barriers are a special case of the latter [Gupta 1989]. .NH 3 Input/Output .PP The time spent doing I/O was mentioned twice in Table 2 (Sequent #2, Warp #1). If a program reads a significant amount of data from an I/O device, then the reads should be overlapped with the computation; in other words, the reads should be started early so that they complete before the data is needed. The natural style, however, is synchronous: when the data is referenced, start a disk read and wait until it returns. .PP As a result of the operating system interface on the Sequent, the current implementation of Quartz measures time spent doing I/O as processor time, attributed to the procedure that performs the I/O. If the I/O is not overlapped, the relative importance of the time spent waiting in the kernel will be increased because processors will be idle waiting for the I/O to finish. Given kernel support for threads, Quartz could monitor the kernel disk queue as a normal synchronization object. .PP If a program spends a lot of time doing disk accesses, it may benefit from exploiting parallelism in the disk sub-system. Tuning a program's use of parallel disks is in many ways similar to tuning its use of parallel processors, although initial file placement is an issue as well. We expect that some of the techniques we have described in this paper could be applied to this problem. .NH 3 Limitations .PP We have designed Quartz to measure only those aspects of program behavior that are needed to detect and fix frequently occurring parallel performance problems. The tradeoff is that Quartz therefore does not help with every performance problem that can occur in parallel applications. .PP When threads execute at the same time, Quartz weights each equally even though only one is on the critical path. As an example, consider a program with a critical section that restricts parallelism. The processor time spent executing outside of the critical section can appear important, because there are few other processors concurrently executing, even though reducing or parallelizing it will have no effect on program's performance. Although this can seem anomalous, Quartz's metric can help in this case by identifying code that may be a secondary bottleneck. We are currently investigating ways of augmenting Quartz's measurements to address this limitation. For instance, each time a processor goes idle, we could measure how long it stays idle, and then add that time to processor time of the code that causes the processor to become busy. This metric correctly handles busy critical sections (the time spent waiting for the lock would be attributed to the critical section), but it does worse than normalized processor time in other situations. .PP Quartz also does not measure thread scheduling decisions (although problems can sometimes be identified, for instance, if a thread spends a long time waiting for a processor and then executes serially) or contention for the bus or memory, even though these can affect performance. .\" *** Dumped! .\" on program .\" performance. The assignment of threads to processors is often under the .\" control of the application, either because threads are implemented as .\" an application-level library, or because the operating system kernel .\" provides .\" this support. Different assignments can result in different performance. .\" For example, threads that frequently inter-communicate .\" should be co-scheduled to avoid context switches [Ousterhout 1984], .\" but Quartz does not measure whether two threads did or did not .\" run concurrently. .\" .PP .\" However, the behavior of the scheduler, and its effect on performance, .\" depends on thread execution times and the structure of communication .\" between threads. These both change as the program is tuned. .\" The schedule also depends on the number of processors given to .\" the application. Thus, tuning the thread schedule is likely to .\" be done last, if at all. .\" .PP .\" Similarly, bus contention (or memory contention in systems with .\" multistage interconnection networks) affects parallel performance on .\" shared memory multiprocessors. Again, this issue is likely .\" to become important only once the program is already fairly well .\" parallelized, .\" since only then would many processors be actively accessing the bus. .\" Measuring the program to help tune this is a problem, however. .\" It is relatively simple to build a hardware monitor to measure .\" total bus utilization, but this does not indicate which sections .\" of code should be changed to reduce the utilization. Rather, .\" if this becomes important, as we think it will eventually, .\" we would need a tool that collects a memory-reference trace .\" (at least of shared variables), processes the trace to determine .\" which references would result in bus traffic, .\" and ties the bus traffic back to particular data structures. .\" While cache misses that would occur on a uniprocessor probably .\" cannot be optimized, bus traffic that communicates results .\" between processors perhaps can be. .\" Weber and Gupta [1989], for instance, have measured several different .\" shared variable access patterns that have widely different performance .\" under invalidation-based cache coherency. .NH 2 The Implementation of Quartz .PP We have implemented Quartz on a Sequent Symmetry shared memory multiprocessor [Sequent 1988]. The Sequent runs DYNIX, a multiprocessor adaptation of UNIX. Since DYNIX processes are too expensive to use directly as threads, we built our system by adding monitoring code to the thread package described in [Anderson et al. 1989]. That thread package works by creating a DYNIX process for each processor, and then multiplexing threads onto the DYNIX processes. Our implementation did not modify DYNIX or the C compiler; it used only the support they provide for gprof. .PP Our implementation addresses the twin concerns of efficiency and accuracy. Because program tuning is iterative and interactive, a tool's usability depends on the elapsed time from program compilation to report production. Accuracy is trickier. Unlike sequential programs where the execution overhead due to monitoring is easily factored out, a change to a parallel program can alter its behavior in subtle ways. For instance, monitoring code that increases the time that a lock is held may increase the contention for the lock. Analogously, instrumentation added outside of a critical section will cause a net decrease in the contention for that critical section. .PP Our approach is to use statistical sampling by a dedicated processor. A set of processors executes the program normally, maintaining their state in shared memory by special code executed during thread operations and at procedure entry and exit. This state is then sampled by a dedicated processor that does not participate in executing the program. We impose no synchronization beyond hardware interlocks between the sampling processor and the other processors; rarely-accessed locks are used by the normally executing processors in building the call graph. .PP We sample by means of a dedicated processor rather than interrupts because interrupts cannot provide accurate correlations between processor state and overall program state. On the Sequent, as with most multiprocessors, interrupts are fielded by each processor asynchronously; by the time the program state is sampled, it may have changed in a way affected by the fact that there was an interrupt. For example, if the interrupted processor is holding a lock, the queue length at the lock will be greater than a purely random sample would indicate. Similarly, measuring a procedure's execution time directly with timestamps does not allow us to correlate that time to the number of busy processors. Of course, although it reduces the effect of monitoring on the measured program, sampling by a dedicated processor does not eliminate distortion. Updating state adds time to the computation, and even the recording of samples by the dedicated processor can increase bus and cache coherence traffic, thereby slowing other processors. .\" *** The implementation section is too long. Rather than crafting .\" *** it, I'm using a hatchet. .\" .PP .\" Another approach would be to generate a timestamped trace .\" of all procedure calls and interprocessor synchronization events, .\" and, to avoid having to save the trace to a file, concurrently process .\" it to compute the useful metrics. Such a trace would contain all of .\" the information needed for the measurements we describe above. .\" A similar approach is used by Parasight to collect normal gprof data .\" [Aral & Gertner 1988]. .\" Unfortunately, this approach imposes severe real-time constraints. .\" It is difficult to make even an estimate of the rate at which events .\" will be generated; thus, it would be hard to know how many .\" extra processors would be needed to prevent buffers of trace data .\" from overflowing. Further, the more processors that must be .\" dedicated to making measurements, the less applicable are those .\" measurements to the program's performance without monitoring, .\" when it is given all of the system's processors. .\" .PP .\" Instead, in our implementation, the processors executing the program .\" maintain state in shared memory that is statistically sampled from .\" a dedicated processor. If the sampling rate achieved by a single .\" processor is too low, it can .\" be increased linearly by adding processors. .\" In a multiprogrammed environment, the dedicated processor must stop .\" sampling a processor when it is pre-empted from the application. .\" This can be done either implicitly, if all processors, including .\" the dedicated processor, are pre-empted together, or explicitly .\" by being told by the operating system when a pre-emption occurs. .\" (Tucker and Gupta [1989] have shown severe performance degradation .\" can occur if applications are not informed when some of their processors .\" are pre-empted.) .PP The nominal and effective parallelism are maintained in centralized counters, updated when a thread is added to the ready queue and when a processor becomes idle or starts or stops spinning. The counters are maintained with atomic increment and decrement instructions, to avoid making access to them a bottleneck. Most multiprocessors, including the Sequent, support such instructions. .PP In addition to the execution stack, we maintain a profile stack of monitored procedures for each thread. This allows us to record both the time spent in a procedure and the time spent on behalf of the procedure. The dedicated processor copies the number of busy and ready threads, copies the profile stack, and then bumps the appropriate measurement record for each different procedure on the stack. (Recursive procedures are counted only once.) While the stack may have changed between recording the number of busy threads and copying the stack, reducing consistency, sampling itself is only an approximation. In particular, we do not lock the profile stack to prevent changes from occurring; this would unnecessarily perturb the execution of the program. Note that locking would be harder to avoid if we were to sample directly from the execution stack since that would require tracing the chain of frame pointers. .PP The profile stacks are of fixed size, established at compile time. Overflows are caught, prevented, and later reported to the programmer. We expect that overflows will occur only rarely, since we push a procedure onto the stack only if the previous entry is different, eliminating immediately recursive calls, the most common cause of arbitrarily deep stacks. This also has the effect of reducing the work of the sampling processor. .PP We use only the normal compiler support provided for gprof. A monitoring routine is called in the prologue of each profiled procedure. Exactly as gprof does, we use this routine to update the count of calls to the procedure from its caller; we also push the procedure onto the profile stack. Because the compiler inserts only a prologue call, we manipulate the execution stack so that when the procedure returns, it returns first to our code that pops the profile stack, and then to the caller procedure. This is a bit inefficient, but easy to implement. .PP To simplify mapping from the entries on the profile stack to the measurement data for each procedure, we assign each procedure a unique ID. The gprof monitoring routine is passed a pointer to a procedure-specific location; this was originally used to count the number of calls to the procedure. After the program has been linked, we modify the object file so that each procedure's location holds a unique ID; this ID is what is pushed onto the stack and used to index the procedure's data record. Gprof, by contrast, uses the address in the program counter to index the data record for a procedure; this requires space proportional to the size of the code segment. By using procedure IDs, Quartz requires space proportional to the number of procedures times the number of processors. .PP The synchronization routines in our thread library are specially modified. Each object has a data record containing the call graph and execution time information. When the object is accessed, the call graph is updated and a pointer to the synchronization object is pushed; the pointer is popped when the thread no longer must wait. Locks are handled as a special case. Normally, the procedure that acquires a lock is the one that releases it, in which case we are safe to push the lock before it is accessed and pop it after it is released. This attributes the time spent waiting for and holding the lock to the lock object, and only adds two instructions to the inside of the critical section: setting the state of the thread to no longer spinning, and incrementing the number of busy processors. .PP When a thread is created, we copy the profile stack from the creating thread to the new thread. This allows the sampling processor to attribute execution time across asynchronous procedure calls. .\" A pointer to the thread-specific data record is also pushed. .PP Quartz has two other useful features. More than one processor can be dedicated to sampling to improve measurement accuracy and resolution. Quartz automatically removes from its measurements most of the time spent by the normally executing processors in updating their monitored state. .PP Our system does not currently provide for interactive control of which routines are to be profiled. This would be easy to add, .\" *** Again, I'm hatcheting what I consider to be detail. I admit .\" *** that at this point I'm not using much subtlety. .\" A bit could be associated with .\" each procedure ID, synchronization .\" object, or thread to turn profiling on or off .\" during execution. Note that by using a separate stack, we correctly .\" attribute execution time spent in procedures that are not profiled .\" to the procedures that called them that were profiled. Because gprof .\" uses the call graph to propagate execution time, and no call graph data .\" is recorded for non-profiled procedures, it cannot do this. .\" .PP but in truth, we doubt that it is the right approach. Aral and Gertner [1988] argue that gprof's overhead is too high to allow only compile-time control. They use this to motivate Parasight, a system for execution-time code modification and re-linking. But the overhead of gprof, and of our system, could be dramatically reduced with a small amount of compiler support. For example, most of the time in gprof is spent building the call graph; it crawls up the execution stack to find the caller address, hashes on it, checks the callee address, etc. A simpler method is to determine caller-callee pairs at compile time and to simply bump a statically allocated counter before each call. Calls made via function pointers, a rarer case, could use the current, slower approach. .NH A Case Study: Using Quartz to Tune a CAD Circuit Verifier .PP We argued "abstractly" in Section 3.2 that Quartz is well-suited to detecting and fixing a spectrum of parallel program performance problems that have been identified by others as commonly occurring. Of course, the crucial question is whether Quartz is an effective tool in practice. In this section, we describe our experience in using Quartz to tune an existing parallel application. .PP The application we tuned, called Verify [Ma et al. 1987], compares two different circuit implementations to determine whether they are functionally (Boolean) equivalent. It was written for a dissertation to demonstrate that an existing production CAD program could be parallelized with good speedup. The program has 2900 lines of C code, and was written for a Sequent Balance with twelve processors. The circuits we used as inputs in our tests were combinational benchmarks for evaluating test generation algorithms. .PP The initial speedup of Verify on our Sequent Symmetry was already good: 9.2 using 18 processors. (Because no sequential version of the program was available, speedup was measured as the time to run the program, including process creation and I/O, on one processor divided by the time to run it on 18 processors.) Even though neither of us was familiar with the program or with CAD algorithms in general, over the course of several hours we were able to improve its performance by 40%. Its initial runtime was 114.4 seconds on one processor and 12.4 seconds on 18; with our changes, the runtime dropped to 7.7 seconds on 18 processors. Most of this improvement came within the first few minutes of using Quartz, demonstrating the utility of using normalized processor time as a weighting function. .PP Figure 1 shows a portion of the Quartz output when run on the initial version of Verify. After the data has been collected, Quartz can interactively draw a graph (on an X Window display) of the total normalized processor time of any monitored procedure or synchronization object as a function of the number of concurrently busy processors. Normalized processor time is based on each routine's processor time divided by its concurrent parallelism, whether the parallelism is due to that routine or to other activity in the program. The top-level routine "main", however, is responsible for all of the activity in the program; its graph of normalized processor time is equivalent to the elapsed time the program spent with each number of busy processors. Different shadings highlight the role of the routine itself, and its children, in the routine's total normalized processor time. .KF .sp 0.5 .ft C .ce 2 Procedure: main .br Normalized processor time: 16.36 sec. (100%) .sp 2.5i .br .TS center; l c s c l c s c l n n n. Name Normalized Calls processor time _ .sp 0.29 work 10.43 (64%) 18 .sp 0.29 create_cone 3.62 (22%) 2 .sp 0.29 input 1.93 (12%) 2 .sp 0.29 main self .31 ( 2%) .TE .sp 0.5 .ft P .ce 5 \fBFigure 1: Initial Quartz Output for Verify "main"\fR .ce 0 .sp 0.5 .KE .PP The Quartz output for Verify shows that although most of the program is indeed highly parallel, a significant portion of its runtime is spent executing serial code. Further, Quartz identifies "create_cone" as being responsible for most of this serial execution time. By examining the graph for "create_cone", and in turn the graph for its child with the largest normalized processor time, we found that the program was spending a quarter of its time executing print routines in order to log a trace of shared data structures as they were created. .PP While Quartz easily identified this performance problem, gprof would not have. The logging routines accounted for less than 2% of the program's sequential processor time, but because they occurred during the serial initialization phase of the program, they accounted for a much larger share of the parallel performance. Before Quartz pointed it out, we did not know that the program was even doing logging. .PP Quartz allowed us to make an informed choice about a performance tradeoff: we could substantially improve performance by removing logging, or if this functionality was central to the program, at least we would know by how much it reduced performance. Hypothesizing that it was not important, we made logging a command line option. With logging turned off, the program's parallel performance improved by close to the amount predicted by Quartz, while its serial performance improved only slightly. As a result, the program's speedup improved from 9.2 to 12.2. .PP When we re-profiled the modified program, Quartz showed that a significant fraction of the program's runtime was still spent in the sequential initialization phase. To reduce this, we read in and allocated data structures for the two input circuits in parallel with each other and with the operating system process creation needed to start the program running on all 18 processors. This improved performance somewhat to a speedup of 12.9. .PP At this point, we stopped trying to further parallelize the program. Once a program's speedup is high, further improvements become much more difficult. The routines responsible for the difference from ideal speedup account for only a small fraction of the total program runtime; thus even radical improvements to these routines can reduce overall runtime only slightly. .PP In our case, Quartz showed that virtually all of the program's runtime was now being spent executing entirely in parallel. The remaining time was split between processor starvation during initialization and termination. During initialization, Quartz showed that performance was limited by the fact that the input files were not balanced (one circuit was larger and therefore took longer to read in than the other). During termination, the problem was that some processors finished early and had to wait for the rest to finish; dividing the problem into smaller size sub-problems might help this problem. Fixing these problems seemed to be more effort than it was worth. .PP Instead, we noted that small changes in the routines that account for most of the parallel execution time would have a large relative effect on runtime. According to Quartz, two routines accounted for over half the execution time of the modified program, and we were able to improve performance somewhat with a few quick tweaks to these routines. (These routines were already highly tuned from the original sequential program.) .PP Table 3 summarizes the changes that we made to Verify and their effects on sequential and parallel performance. Note that our last improvement in fact decreased the program's speedup. By reducing the execution time of the parallel portion of the program, we increase the relative importance of the program's sequential component. .KF .sp 0.5 .TS center box; c c s c c c c c l n n n. Version Elapsed Time (sec.) Speedup Serial Parallel _ Original 114.4 12.4 9.2 Without logging 109.7 9.0 12.2 Parallel input 109.7 8.5 12.9 Serial optimization 92.3 7.6 12.2 .TE .sp 0.5 .ce 5 \fBTable 3: Performance Effect of Changes to Verify\fR .ce 0 .sp 0.5 .KE .PP A major motivation behind Quartz is efficiency. We measured the overhead Quartz added to this application. Quartz increased the elapsed runtime of Verify by roughly the same amount as gprof: about 70%. (While Quartz is able to remove most of this overhead from its measurements, some overhead does appear in the graph in Figure 1.) Quartz is as fast as gprof because even though Quartz does more work on each procedure call and synchronization event, it need not periodically interrupt (and thereby slow) the execution of the program, because a dedicated processor is used for sampling instead. Verify makes stringent demands on Quartz: it generates roughly 9 million procedural and synchronization events (roughly 1 million per second when running on 18 processors). Even with 18 processors running, a single dedicated sampling processor was able to sample each processor's activity every 6 milliseconds, faster than gprof's sampling rate on DYNIX. .PP Something that we did not expect was demonstrated by using Quartz on a real application: there is less "performance locality" in parallel programs than in sequential ones. The top eight procedures account for 95% of the elapsed time of Verify on one processor. With 18 processors, though, it takes over 20 procedures to reach the same 95% level. In retrospect, the reason is obvious. The routines that account for most of the time on one processor are parallelized and therefore account for much less of the program's runtime on multiple processors; at the same time, routines that take only a small amount of processor time can become important if they run sequentially. .NH Implications for Other Systems .PP While we have implemented Quartz on a shared memory multiprocessor, our work has implications for other systems. .PP On multiprocessors with distributed memory, such as the Intel Hypercube, a dedicated sampling processor would not have efficient access to the state of other processors. Explicit messages would have to be used to update the counts of effective and nominal parallelism, as well as the procedures each processor was executing. A further problem is that programs on these systems are often explicitly written to use a specific number of processors because of the need to explicitly control the communication pattern; removing one for sampling might require re-writing the program. .PP Alternative approaches also have significant drawbacks on such systems. In particular, recording and post-processing a complete trace may already be impractical for some programs, and will become more so as distributed memory multiprocessors support faster rates of interprocessor communication. .PP Efficient sampling could be implemented given hardware support for stopping all processors at close to the same time (i.e., by allowing a host computer to send parallel interrupts each processor). One of the reasons for using a dedicated processor is that interrupting any single processor to do sampling can distort the behavior being measured. If all were stopped together, the sample could be taken from that snapshot without measurement error. The sampling could be implemented efficiently by using the processing power of the stopped processors. .PP Absent hardware support, it may be possible to exploit the characteristics of parallel programs on distributed memory multiprocessors. Because of the requirement that interprocessor communication be explicitly programmed using messages, these systems are most commonly used for highly data-parallel applications with regular communication patterns. For these types of programs, at any point during the computation, each processor executes roughly the same section of code, although one may finish before another. As a result, sampling the behavior of each individual processor, and not the global state, may yield a sufficiently detailed picture of program performance. .PP The techniques used in Quartz also solve some problems with traditional approaches to tuning sequential program performance. One limitation to gprof is that it cannot be used to tune the implementation of operating system kernel-level routines, since interrupt-driven sampling cannot measure code that runs with interrupts disabled. By using a separate processor to do sampling, however, we would be able to accurately measure kernel-level execution. .\" The kernel routines would update the profile stack. .\" If the stack is made readable to application-level programs, .\" the sampling routines need not even run in the kernel. Note that by avoiding synchronization between the executing processors, or between them and the sampling processor, the processor in the kernel can execute the profiling code even if it holds the low level scheduler lock. .PP Similarly, by using a profile stack for sampling, we are able to account correctly for execution time spent out of the monitored address space. Because of the way gprof propagates execution time spent on behalf of a procedure, it cannot accurately attribute time spent executing in non-profiled code, or in the operating system on behalf of the program. However, a trend in the design of operating systems and large applications is to decompose them into separate modules in different address spaces, so that hardware protection mechanisms can be used for failure isolation. Much of the execution time of an editor, for instance, might be spent in another address space responsible for updating the display. The performance of the entire decomposed system could be measured by sampling profile stacks shared among all address spaces. .\" sampling could focus on a single address space by attributing time .\" spent in other address spaces to the procedure that invoked that activity. .PP Data-oriented measurement can be helpful for tuning sequential programs as well as parallel programs. .\" A trend in software engineering .\" is toward object-oriented programming. For some programs, knowing that a particular procedure accounts for a large proportion of the processing time may not be as useful as knowing that a particular object is expensive. For example, there is better graphics resolution in a more finely tiled sphere, but it also costs more to draw. Gprof introduces a systematic bias in measuring object-oriented program behavior because it assumes that a procedure call always takes the same amount of time to execute (i.e., that the time does not depend on the data that is passed). Quartz avoids this bias by propagating execution times explicitly. While we currently make only limited measurements of data-oriented performance behavior, our design is extensible to a more thorough object-oriented implementation. .NH Conclusions .PP Achieving good performance from parallel applications is both crucial and challenging. We have discussed the design rationale, functionality, implementation, and use of Quartz, a tool for tuning parallel program performance on shared memory multiprocessors. .PP The philosophy underlying our work is that an effective tool for tuning parallel program performance must be based on a clear view of the causes of poor performance, and on a specific methodology for improving that performance. By being selective about what it measures and presents, Quartz can focus the programmer's attention on the information needed to tune performance. Measurement efficiency also results from designing the tool to record just the important behavior. .PP By relating the execution of sections of code and the use of certain data objects with the concurrent behavior of other processors, Quartz assists in identifying areas of the program where re-structuring is necessary to improve performance, and in gaining insight into the types of re-structuring that will work. Because Quartz organizes performance information according to the logical structure of the program, the programmer can tune performance in a top-down fashion. .SH Acknowledgments .PP We would like to thank Hi-Keung Tony Ma for donating the application program we used for our case study, and Brian Bershad, Henry Levy, and John Zahorjan for their extensive comments. .SH References .\" .nr PS 8 .\" .ps 8 .\" .nr VS 9 .\" .vs 9 .nh .LP [Anderson et al. 1989] .RS Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy. The Performance Implications of Thread Management Alternatives for Shared Memory Multiprocessors. \fIIEEE Transactions on Computers 38\fR,12 (December 1989), pp. 1631-1644. .\" \fI1989 ACM SIGMETRICS and Performance '89 .\" Conference on Measurement and Modeling of Computer Systems\fR, .\" pp. 49-60, May 1989. To appear, .RE .LP [Aral & Gertner 1988] .RS Ziya Aral and Ilya Gertner. Non-Intrusive and Interactive Profiling in Parasight. \fIProc. ACM/SIGPLAN PPEALS 1988\fR, pp. 21-30. .RE .LP [BBN 1985] .RS BBN Laboratories. Butterfly Parallel Processor Overview. 1985. .RE .\" .LP .\" [Bershad 1988] .\" .RS .\" Brian Bershad. .\" Thrips: A Performance Debugging Tool for Parallel Programs. .\" Department of Computer Science, University of Washington, 1988. .\" .RE .LP [Bershad et al. 1988] .RS Brian Bershad, Edward Lazowska, and Henry Levy. Presto: A System for Object-Oriented Parallel Programming. \fISoftware: Practice and Experience 18\fR,8 (Aug. 1988), pp. 713-732. .RE .LP [Burkhart & Millen 1989] .RS H. Burkhart and R. Millen. Performance Measurement Tools in a Multiprocessor Environment. \fIIEEE Transactions on Computers 38\fR,5 (May 1989), pp. 725-737. .RE .LP [Carpenter 1987] .RS R.J. Carpenter. Performance Measurement Instrumentation for Multiprocessor Systems. In \fIHigh Performance Computer Systems\fR, ed. E. Gelenbe, North-Holland, pp. 81-92, 1987. .RE .\" .LP .\" [Couch 1988] .\" .RS .\" A.L. Couch. .\" Graphical Representations of Program Performance on Hypercube .\" Message-Passing Multiprocessors. .\" Ph.D. Thesis, Department of Computer Science, Tufts University, April 1988. .\" .RE .\" .LP .\" [Davis & Hennessy, 1988] .\" .RS .\" Helen Davis and John Hennessy. .\" Characterizing the Synchronization Behavior of Parallel Programs. .\" \fIProc. ACM/SIGPLAN PPEALS 1988\fR, pp. 198-211. .\" .RE .\" .LP .\" [Dritz & Boyle 1987] .\" .RS .\" Kenneth W. Dritz and James M. Boyle. .\" Beyond "Speedup": Performance Analysis .\" of Parallel Programs. .\" Technical Report ANL-87-7, Mathematics and .\" Computer Science Division, Argonne National Laboratory, .\" Feb. 1987. .\" .RE .LP [Fowler et al. 1988] .RS Robert J. Fowler, Thomas J. LeBlanc, and John M. Mellor-Crummey. An Integrated Approach to Parallel Program Debugging and Performance Analysis on Large-Scale Multiprocessors. \fIProc. ACM SIGPLAN/SIGOPS Workshop on Parallel and Distributed Debugging\fR, May 1988. .RE .LP [Graham et al. 1982] .RS S.L. Graham, P.B. Kessler, and M.K. McKusick. Gprof: A Call Graph Execution Profiler. \fIProc. ACM SIGPLAN Symposium on Compiler Construction\fR, June 1982. .RE .\" .LP .\" [Gregoretti & Segall 1985] .\" .RS .\" F. Gregoretti and Z. Segall. .\" Programming for Observability Support in a Parallel Programming Environment. .\" Department of Computer Science, Carnegie Mellon University, 1985. .\" .RE .LP [Gupta 1989] .RS R. Gupta. The Fuzzy Barrier: A Mechanism for High Speed Synchronization of Processors. \fIProc. 3rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-III)\fR, pp. 54-63, April 1989. .RE .\" .LP .\" [Hoare 1978] .\" .RS .\" C.A.R. Hoare. .\" Communicating Sequential Processes. .\" \fICommunications of the ACM 21\fR,8 (Aug. 1978), pp. 666-677. .\" .RE .LP [Halstead 1986] .RS R. Halstead, Jr. An Assessment of Multilisp: Lessons from Experience. \fIInternational Journal of Parallel Programming 15\fR,6 (Dec. 1986). .RE .\" .LP .\" [Joyce et al. 1987] .\" .RS .\" J. Joyce, G. Lomow, K. Slind, B. Unger. .\" Monitoring Distributed Systems. .\" \fIACM Transactions on Computer Systems\fR, vol. 5, no. 2, May 1987, .\" pp. 121-150. .\" .RE .LP [Kerola & Schwetman 1987] .RS Teemu Kerola and Herb Schwetman. Monit: A Performance Monitoring Tool for Parallel and Pseudo-Parallel Programs. \fIProc. 1987 ACM SIGMETRICS Conference\fR, May 1987. .RE .LP [Ma et al. 1987] .RS H.T. Ma, S. Devadas, R. Wei, and A. Sangiovanni-Vincentelli. Logic Verification Algorithms and Their Parallel Implementation. \fIProc. 24th Design Automation Conference\fR, July 1987, pp. 283-290. .RE .LP [Malony et al. 1989] .RS Allen Malony, Daniel Reed, James Arendt, Ruth Aydt, Dominique Grabas, and Brian Totty. An Integrated Performance Data Collection, Analysis, and Visualization System. \fIProc. 4th Conference on Hypercubes, Concurrent Computers, and Applications\fR, 1989. .RE .LP [Miller & Yang 1987] .RS Barton P. Miller and C.-Q. Yang. IPS: An Interactive and Automatic Performance Measurement Tool for Parallel and Distributed Programs. \fIProc. 7th International Conference on Distributed Computing Systems\fR, September 1987. .RE .\" .LP .\" [Miller et al. 1989] .\" .RS .\" B. Miller, M. Clark, S. Kierstead, S. Lim. .\" IPS-2: The Second Generation of a Parallel Program Measurement System. .\" Department of Computer Science, University of Wisconsin, 1989. .\" .RE .LP [Moeller-Nielsen & Staunstrup 1987] .RS P. Moeller-Nielsen and J. Staunstrup. Problem-Heap: A Paradigm for Multiprocessor Algorithms. \fIParallel Computing 4\fR, North-Holland, 1987, pp. 63-74. .RE .\" .LP .\" [Ousterhout 1984] .\" .RS .\" J. Ousterhout. .\" Scheduling Techniques for Concurrent Systems. .\" \fIProc. 3rd International Conference on Distributed Computing Systems\fR, .\" 1984. .\" .RE .LP [Pfister et al. 1985] .RS G. Pfister, W. Brantley, D. George, S. Harvey, W. Kleinfelder, K. McAuliffe, E. Melton, V. Norton, and J. Weise. The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture. \fIProc. 1985 International Conference on Parallel Processing\fR, August 1985. .RE .LP [Rodgers 1986] .RS David P. Rodgers. Personal communication. .RE .LP [Segall & Rudolph 1985] .RS Zary Segall and Larry Rudolph. PIE: A Programming and Instrumentation Environment for Parallel Processing. \fIIEEE Software 2\fR,6 (November 1985). .RE .LP [Sequent 1988] .RS Sequent Computer Systems, Inc. Symmetry Technical Summary. .RE .LP [Thacker et al. 1988] .RS Charles Thacker, Lawrence Stewart, and Edward Satterthwaite Jr. Firefly: A Multiprocessor Workstation. \fIIEEE Transactions on Computers 37\fR,8 (Aug. 1988), pp. 909-920. .RE .\" .LP .\" [Tucker & Gupta 1989] .\" .RS .\" A. Tucker and A. Gupta. .\" Process Control and Scheduling Issues for Multiprogrammed Shared Memory .\" Multiprocessors. .\" \fIProc. 12th ACM Symposium on Operating Systems Principles\fR, .\" December 1989. .\" .RE .\" .LP .\" [Weber & Gupta 1989] .\" .RS .\" W. Weber and A. Gupta. .\" Analysis of Cache Invalidation Patterns in Multiprocessors. .\" \fIProc. Third International Conference on Architectural .\" Support for Programming Languages and Operating Systems (ASPLOS-III)\fR, .\" April 1989, pp. 243-255. .\" .RE .LP [Yang & Miller 1988] .RS Cui-Qing Yang and Barton Miller. Critical Path Analysis for the Execution of Parallel and Distributed Programs. \fIProc. 9th International Conference on Distributed Computing Systems\fR, pp. 366-373, June 1988. .RE .\" .PP .\" This .\" accounted for less than 2% of total processor time according .\" to gprof \- an acceptable overhead. .\" However, Quartz showed that this logging .\" actually accounted for 25% of the program's .\" runtime, because it occurred during the sequential initialization .\" phase of the program. .\" The importance of reflecting the program .\" structure by reporting the work done .\" on behalf of a procedure was demonstrated here, .\" since the time was split among several .\" lower-level procedures called by the logging function. .\" When logging was removed, the program's speedup .\" improved from 9.2 to 12.2.